Mining Parenthetical Translations from the Web by Word Alignment
نویسندگان
چکیده
Documents in languages such as Chinese, Japanese and Korean sometimes annotate terms with their translations in English inside a pair of parentheses. We present a method to extract such translations from a large collection of web documents by building a partially parallel corpus and use a word alignment algorithm to identify the terms being translated. The method is able to generalize across the translations for different terms and can reliably extract translations that occurred only once in the entire web. Our experiment on Chinese web pages produced more than 26 million pairs of translations, which is over two orders of magnitude more than previous results. We show that the addition of the extracted translation pairs as training data provides significant increase in the BLEU score for a statistical machine translation system.
منابع مشابه
Mining Parenthetical Translations for Polish-English Lexica
Documents written in languages other than English sometimes include parenthetical English translations, usually for technical and scienti c terminology. Techniques had been developed for extracting such translations (as well as transliterations) from large Chinese text corpora. This paper presents methods for mining parenthetical translation in Polish texts. The main di erence between translati...
متن کاملSemi-Supervised Lexicon Mining from Parenthetical Expressions in Monolingual Web Pages
This paper presents a semi-supervised learning framework for mining Chinese-English lexicons from large amount of Chinese Web pages. The issue is motivated by the observation that many Chinese neologisms are accompanied by their English translations in the form of parenthesis. We classify parenthetical translations into bilingual abbreviations, transliterations, and translations. A frequency-ba...
متن کاملImproved Word Alignments Using the Web as a Corpus
We propose a novel method for improving word alignments in a parallel sentence-aligned bilingual corpus based on the idea that if two words are translations of each other then so should be many words in their local contexts. The idea is formalised using the Web as a corpus, a glossary of known word translations (dynamically augmented from the Web using bootstrapping), the vector space model, li...
متن کاملThe Bilingual Concordancer TransSearch
TRANSSEARCH is a web-based translation search engine. When a user submits a translation query, the system replies with a set of sentence pairs whose source sentence contains the query. The source expression is highlighted and, with the help of statistical word alignment techniques, the corresponding target expression is also identified. When many sentences share the same translations, the trans...
متن کاملWord-aligned Parallel Text – A New Resource for Contrastive Language Studies
This paper describes the opportunities that arise from automatic word alignment for bilingual concordances and contrastive language studies. We introduce our parallel corpus of Alpine texts in French and German and our web-based alignment search system. We explain how we have reduced the number of erroneous alignments in the output by distinguishing between dominant and miscellaneous translatio...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2008